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MEMORANDUM  FOR  DISTRIBUTION  LIST 

Subj ;  Center  for  Naval  Analyses  Research  Memorandum  88-236 

Enel:  (1)  CNA  Research  Memorandum  88-236,  A  Maximum-Likelihood 

Procedure  for  Developing  a  Common  Metric  in  Item- Response 
Theory,  by  D.  R.  Divgi,  Dec  1988 

1.  Enclosure  (1)  is  forwarded  as  a  matter  of  possible  interest. 

2.  A  computerized  adaptive  version  of  the  Armed  Services  Vocational 
Aptitude  Battery  has  been  developed.  This  development  required 
application  of  item-response  theory  to  two  different  item  pools 
administered  to  non- equivalent  samples.  In  such  cases,  the  ability 
scales  in  the  two  samples  must  be  placed  on  a  common  metric.  A  maximum 
likelihood  procedure  for  doing  so  is  presented  and  illustrated  with 
examples . 
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ABSTRACT 

Because  the  ability  scale  in  item- 
response  theory  is  arbitrary,  if  two 
item  pools  are  calibrated  in  two 
different  samples,  their  parameter 
estimates  must  be  placed  on  a  common 
metric  using  items  administered  in  both 
calibrations.  In  this  memorandum,  a 
maximum- likelihood  procedure  for  doing 
so  is  derived  and  illustrated. 
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EXECUTIVE  SUMMARY 


The  Department  of  Defense  has  developed  a  computerized  adaptive 
testing  (CAT)  version  of  the  Armed  Services  Vocational  Aptitude  Battery 
(ASVAB)  in  the  Accelerated  CAT-ASVAB  Project  (ACAP).  Use  of  the  CAT 
requires  a  large  pool  of  items  for  each  subtest.  For  Arithmetic 
Reasoning  and  Word  Knowledge,  it  became  necessary  to  supplement  the 
original  ACAP  pool  with  items  from  the  experimental  CAT-ASVAB  system 
developed  earlier.  This  memorandum  presents  a  maximum-likelihood 
procedure  for  performing  some  calculations  needed  to  merge  the  two  item 
pools. 

CAT-ASVAB  uses  the  three-parameter  logistic  model  of  item-response 
theory  (IRT).  In  this  model,  each  person  is  characterized  by  an  ability 
parameter  9  and  each  test  item  by  three  parameters  a,  b,  and  c.  The 
quantities  a,  fa,  and  c  are  called  the  discrimination,  difficulty,  and 
guessing  parameters  of  the  item. 

The  metric  of  the  9  scale  is  arbitrary.  One  can  transform  9,  a, 
and  fa  simultaneously  in  such  a  way  that  the  probability  of  answering  an 
Item  correctly  remains  unchanged  for  all  persons  and  items.  This 
creates  a  practical  problem.  Suppose  two  tests  are  calibrated — that  is, 
their  item  parameters  are  estimated,  using  different  samples  of 
examinees.  One  set  of  item  parameters  must  be  transformed  to  the  metric 
of  the  other  before  the  two  sets  of  estimates  can  be  used  together. 

This  requires  that  the  tests  have  some  items  in  common. 

Currently  available  procedures  for  determining  a  transformation 
define  a  criterion  function  and  minimize  it  to  estimate  the 
transformation  parameters.  Although  reasonable,  the  criterion  function 
is  not  based  on  any  principle  or  related  to  the  larger  problem  of 
estimating  item  parameters. 

I  tern  parameters  are  usually  estimated  by  the  method  of  maximum 
Likelihood.  The  same  approach  can  be  extended  to  transform  the  metric 
of  one  calibration  to  that  of  another.  The  method  is  illustrated  in 
this  memorandum  using  four  forms  each  of  five  ASVAB  subtests,  which  were 
included  in  calibrations  of  both  the  experimental  and  ACAP  item  pools. 
Results  using  this  method  are  found  to  be  close  to  those  of  an  earlier 
method  devised  by  Stocking  and  Lord. 

Maximum  likelihood  is  a  viable  procedure  that  can  be  used  with  item 
pools  for  future  versions  of  CAT-ASVAB.  It  requires  less  computation 
than  the  Stocking-Lord  method  and  makes  use  of  information  about 
standard  errors  of  parameter  estimates. 
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INTRODUCTION 


The  Armed  Services  Vocational  Aptitude  Battery  (ASVAB)  is  used  to 
select  and  classify  enlisted  personnel.  It  contains  ten  subtests: 
General  Science  (GS),  Arithmetic  Reasoning  (AR),  Word  Knowledge  (WK), 
Paragraph  Comprehension  (PC),  Numerical  Operations  (NO),  Coding  Speed 
(CS),  Auto  and  Shop  Information  (AS),  Mathematics  Knowledge  (MK), 
Mechanical  Comprehension  (MC),  and  Electronics  Information  (El).  The 
Verbal  (VE)  subtest  is  defined  as  the  sum  of  WK  and  PC. 

The  Department  of  Defense  has  developed  a  computerized  adaptive 
testing  (CAT)  version  of  the  ASVAB  in  the  Accelerated  CAT-ASVAB  Project 
(ACAP).  Use  of  the  CAT  requires  a  large  pool  of  items  for  each 
subtest.  For  Arithmetic  Reasoning  and  Word  Knowledge,  it  became 
necessary  to  supplement  the  original  ACAP  pool  with  items  from  the 
experimental  CAT-ASVAB  system  developed  earlier  [1].  The  purpose  of 
this  memorandum  is  to  present  a  maximum- likelihood  procedure  for 
performing  some  calculations  needed  to  merge  the  two  item  pools. 

METRIC  TRANSFORMATION  IN  IRT 

CAT-ASVAB  uses  the  three  parameter  logistic  model  of  item-response 
theory  (IRT).  In  this  model,  each  person  is  characterized  by  an  ability 
parameter  e  and  each  test  item  by  three  parameters  a,  6,  and  c.  The 

probability  that  a  person  of  ability  9  will  answer  an  item  correctly  is 

given  by 

p(9)  :  c  +  (1  -  c)/[1  ♦  exp{I.7a(b  -  0))]  . 

The  quantities  a,  b,  and  c  are  called,  respectively,  the  discrimination, 
difficulty ,  and  guessing  pat  aii'ieters  of  the  item. 

The  metric  of  the  9  scale  is  arbitrary.  It  is  possible  to  make  a 

linear  transformation  of  0,  a,  and  b  in  such  a  way  that  P(0)  remains 

unchanged.  Suppose  two  tests  are  calibrated — that  is,  their  item 
parameters  are  estimated,  using  samples  of  examinees  from  different 
populations.  One  set  of  item  parameters  must  be  transformed  to  the 
metric  of  the  other  before  a  useful  analysis  (e.g.,  equating)  can  be 
performed.  This  requires  that  the  tests  have  at  least  two  items  in 
common . 

Let  estimates  from  the  second  calibration  be  transformed  to  the 
metric  of  the  first.  Transformed  estimates  of  discrimination  (a)  and 
difficulty  (b)  parameters  are  given  for  each  item  by 

A  f  (  I ) 
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and  for  each  person  by 

e*  z  AQ2  *  B  ,  (3) 

where  *  indicates  a  transformed  value  and  the  subscript  refers  to  the 
calibration.  It  is  easy  to  verify  that  the  probability  P(e)  is 
invariant  under  such  transformations. 

Recent  procedures  for  estimating  A  and  a  are  found  in  Stocking  and 
Lord  [2]  and  Divgi  (31.  These  methods  estimate  the  parameters  by 
minimizing  a  criterion  function,  which  is  a  weighted  sum  of  squares. 

The  procedures  are  ad  hoc  in  that  the  criterion  functions,  although 
reasonable,  are  not  based  on  any  principle.  The  purpose  of  this 
memorandum  is  to  relate  the  estimation  of  A  and  a  to  the  larger  problem 
of  estimating  item  paraimeters.  This  leads  to  a  procedure  that,  like 
parameter  estimation,  is  based  on  the  principle  of  maximizing  a 
likelihood  function. 

THE  MAXIMUM-LIKELIHOOD  APPROACH 

No  metric  transformation  would  be  necessary  if  a  single  Joint 
calibration  were  performed  using  both  samples  at  once.  The  two 
calibrations  provide  independent  sets  of  parameter  estimates  for  each 
item.  If  one  tries  to  combine  them  so  as  to  approximate  the  single  set 
of  estimates  that  a  joint  calibration  would  yield,  a  procedure  for 
metric  transformation  emerges. 

Ideally,  all  three  parameters  should  be  included  in  the 
calculations.  However,  the  guessing  parameter  c  is  often  difficult  to 
estimate  with  the  sample  sizes  available  in  practice.  Wainer  and 
Thissen  [4]  have  shown  that  theoretical  standard  errors  of  the  estimates 
of  c  can  be  very  high  for  easy  items.  For  this  reason,  compromises  have 
to  be  made:  data  on  different  items  must  be  pooled  or  Bayesian  prior 
distributions  must  be  used  to  keep  the  estimates  reasonable.  Standard 
errors  of  these  estimates  are  much  smaller  than  their  theoretical 
values.  Hence,  given  that  che  c  parameter  is  not  estimated  by  pure 
maximum  likelihood  in  the  original  calibrations,  no  direct  use  of  it  is 
made  in  the  theory  given  below. 

Let  vectors 


Pi  -  ( ^  1  ^  1  ) ' 

and 

P2  -  ( u  2  ^2  ^ 
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denote  the  two  pairs  of  estimates  for  an  item  common  to  both  tests. 

They  maximize  log  likelihoods  and  L2  in  the  two  samples.  Now  suppose 
a  joint  calibration  is  performed  and  the  results  transformed  to  the 
metric  of  calibration  1.  Denote  these  estimates  by 

p  =  (a  b) ’  , 

which  maximize  Li  +  L2 .  Therefore  p  can  be  calculated  approximately 
from  pi  and  P2 . 

If  the  samples  are  large,  estimates  of  item  parameters  are  close  to 
their  true  values.  Therefore,  if  transformation  parameters  A  and  B  are 
chosen  properly,  pi  and  P2  are  almost  equal.  In  their  neighborhood,  the 
log  likelihoods  of  responses  observed  in  the  samples  are  quadratic 
^unctions  of  the  parameters.  Denote  the  information  matrices,  i.e., 

2  «  2  matrices  of  second  derivatives  of  log  likelihood,  by  li  and  I2. 
(Formulas  for  computing  them  in  the  three -parameter  logistic  model  are 
given  by  Lord  [5].)  Let  and  L2jj,  be  maximum  values  of  log 
likelihoods  of  responses  on  the  common  items  in  calibrations  1  and  2. 

For  any  parameter  vector  p  near  pi  and  P2,  log  likelihood  Li  +  L2  for 
the  two  samples  combined  is  given  by 

^^^1m'^^2m"^r^2^  ‘  ^  (  P-Pl ) '  ll  (p-p  1 )  +  <^P-P2y  ^ 

where  the  sum  is  taken  over  all  items.  Minimizing  this  quantity  over  a 
single  item  leads  to  a  litiear  equation  for  p.  Its  solution  yields 

P-P2  • 

A  little  matrix  manipulation  shows  that  the  minimum  value  of  the  item's 
contribution  to  equation  (4)  is 
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(5a) 


(p^“P2^  f 


where 


s  =  I,  -  I,(I,*l2)''  I, 


(5b) 


Multiplication  verifies  that 


-1  -1 
S  =  (l/  +  ^2  ^ 


Thus,  for  any  given  A  and  a,  after  minimizing  over  p  for  each  common 
item, 


-1  -1  * 
(I/-.I2  ')  (p,-P2) 


(6) 


Minimization  of  this  quantity  over  A  and  a  yields  maximum  likelihood 
estimates  of  the  transformation  parameters. 

The  argument  leading  to  expression  (6)  is  strictly  correct  only  if 
true  abilities  are  known.  In  practice  the  maximum  likelihood  estimates 
of  9  are  used  instead  [51,  or  the  likelihood  is  marginalized  by 
integrating  over  the  distribution  of  ability  (Bock  and  Aitkin  [6]).  It 
does  not  matter  how  the  likelihood  function  is  calculated;  if  it  yields 
satisfactory  estimates  of  item  parameters,  it  can  be  used  to  compute  the 
information  matrices  in  expression  (6). 

The  criterion  function  (6)  is  the  same  as  in  Divgi's  minimum  chi- 
square  method  [31.  In  addition  to  supplying  a  theoretical  basis  for  the 
minimum  chi-square  method,  the  maximum-likelihood  approach  shows  how  the 
guessing  parameter  c  should  be  handled.  Theoretical  information 
functions  involving  derivatives  with  respect  to  c  often  greatly 
overestimate  the  true  standard  errors;  hence  they  are  excluded  from  the 
theory.  Estimates  of  c  do  not  appear  directly  in  the  criterion 
function;  however,  they  are  used  in  computing  2x2  information  matrices 
for  a  and  b. 

ILLUSTRATION 

For  each  subtest  in  CAT-ASVAB,  the  item  pool  was  divided  into 
booklets.  Each  booklet  was  administered  to  a  large  sample  of  military 
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applicants,  along  with  an  operational  form  of  the  ASVAB.  Hence  the  item 
calibration  provided  parameter  estimates  for  operational  ASVAB  items  as 
well  as  for  the  CAT  pool.  This  design  was  used  for  the  ACAP  version  of 
CAT-ASVAB  [7]  and  also  for  the  earlier  experimental  version  [8]. 

ASVAB  forms  9A,  9B,  lOA,  and  10B  were  used  operationally  in  both 
calibrations.  Therefore,  two  sets  of  parameter  estimates  are  available 
for  each  form.  Estimates  for  all  subtests  in  the  ACAP  calibration  and 
for  five  subtests  in  the  experimental  calibration  have  been  provided  to 
the  Center  for  Naval  Analyses  by  the  Navy  Personnel  Research  and 
Development  Center.  These  five  subtests  are  GS,  AR,  WK,  PC,  and  MK. 

The  max imum-1 ikel ihood  and  Stocking-Lord  [2]  procedures  were 
applied  to  each  form  of  each  of  the  five  subtests.  Information  matrices 
needed  in  the  maximum-likelihood  method  were  computed  under  the 
assumption  that  the  ability  distribution  was  standard  normal  in  each 
calibration.  The  same  assumption  was  made  while  sampling  9  values 
needed  in  the  Stocking-Lord  method.  The  normality  assumption  is 
reasonable  and  used  frequently  (for  example,  in  the  calibration  of  the 
ACAP  item  pool  [7] ) . 

The  results  are  presented  in  table  1.  For  any  given  subtest,  the 
results  vary  little  from  one  form  to  another  and  from  one  method  to  the 
other.  This  is  to  be  expected  since  all  eight  values  (e.g.,  for  a)  are 
estimates  of  the  same  quantity. 

The  assumptions  of  the  maximum-1 ikel ihood  approach  are  reasonable, 
and  its  theory  is  simple.  It  is  only  to  be  expected  that  its  results 
should  agree  with  the  more  established  Stocking-Lord  procedure.  The 
illustration  serves  primarily  as  a  check  on  the  computer  program.  It  is 
much  harder  to  decide  whether  one  method  is  clearly  preferable  to  the 
other.  To  do  so  would  require  extensive  data  analyses,  which  are  beyond 
the  scope  of  this  paper.  However,  as  pointed  out  in  [3],  the  chi-square 
method  involves  much  simpler  computations  and,  unlike  the  Stocking-Lord 
method,  makes  use  of  information  about  the  sampling  errors  of  the 
estimates  of  item  parameters. 
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TABLE  1 


RESULTS 


Subtest 

GS 

GS 

GS 

GS 

AR 

AR 

AR 

AR 

WK 

WK 

WK 

WK 

PC 

PC 

PC 

PC 

MK 

MK 

MK 

MK 


OF  MAXIMUM  LIKELIHOOD  AND  STOCKING-LORD  PROCEDURES 


Maximum  Stocking- 


likelihood 

Lord 

Form 

A 

B 

A 

B 

9A 

1.17 

-.27 

1.14 

-.21 

9B 

1.16 

-.28 

1.09 

-.20 

10A 

1.09 

-.28 

1.04 

-.19 

10B 

1.13 

-.23 

1.19 

-.26 

9A 

1.12 

-.30 

1.11 

-.30 

9B 

1.17 

-.31 

1.14 

-.31 

lOA 

1.12 

-.27 

1.13 

-.30 

10B 

1.16 

-.35 

1.13 

-.34 

9A 

1.14 

-.27 

1.15 

-.30 

9B 

1 .24 

-.34 

1.22 

-.33 

10A 

1.10 

-.30 

1.17 

-.35 

10B 

1.16 

-.34 

1.13 

-.33 

9A 

0.87 

-.  10 

0.99 

-.19 

9B 

1.01 

-.26 

1.03 

-.28 

10A 

0.96 

-.16 

1.06 

-.25 

10B 

1.05 

-.19 

1.11 

-.29 

9A 

1.26 

-.45 

1.25 

-.42 

9B 

1.32 

-.51 

1.29 

-.45 

10A 

1.29 

-.50 

1.25 

-.43 

10B 

1.27 

-.45 

1.30 

-.45 
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